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Abstract 

In this paper, we propose subspace alignment based domain adaptation of the state of 
the art RCNN based object detector [O]. The aim is to be able to achieve high quality 
object detection in novel, real world target scenarios without requiring labels from the 
target domain. While, unsupervised domain adaptation has been studied in the case of 
object classification, for object detection it has been relatively unexplored. In subspace 
based domain adaptation for objects, we need access to source and target subspaces for 
the bounding box features. The absence of supervision (labels and bounding boxes are 
absent) makes the task challenging. In this paper, we show that we can still adapt sub¬ 
spaces that are localized to the object by obtaining detections from the RCNN detector 
trained on source and applied on target. Then we form localized subspaces from the 
detections and show that subspace alignment based adaptation between these subspaces 
yields improved object detection. This evaluation is done by considering challenging 
real world datasets of PASCAL VOC as source and validation set of Microsoft COCO 
dataset as target for various categories. 


1 Introduction 


It has been an underlying assumption behind most of the machine learning algorithms that 
training and test instances should be sampled from a similar distribution. But this assumption 
is often violated in a real world scenario, i.e. there is high probability that train and test 
instances can arise from different distributions. This problem is well known as the domain 
shift problem in research community. The problem of domain shift is visible in various 
fields including language processing, speech processing as well as in computer vision tasks. 
Various factors can cause this problem in computer vision. For example, if somebody has 
trained the object classifiers on images being taken from a high quality DSLR camera and the 
test instances are taken from images being taken from a VGA camera then the performance 
of the classifiers is not supposed to be good at all. Apart from the difference in resolution, 
difference in view points, clutter and background can also cause the problem of domain shift. 
Indeed, the problem is particularly pertinent to the computer vision community due to our 
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reliance on ‘standard’ challenging datasets. Each dataset has its own bias and the results 
of one dataset do not easily transfer to other datasets as has been shown by Torralba and 
Efros in an important work [HE]. Due to these factors, domain adaptation task is becoming 
of higher importance in computer vision. However most of the work in this field revolves 
around adapting a classifier for the task of object recognition or classification and not much 
effort has been put to adapt an object detector. 

Through this paper, a contribution we make is to analyse the object detection perfor¬ 
mance between two challenging standard object detection datasets viz, Pascal VOC and 
Microsoft CoCo using the state of the art RCNN object detection technique. While, one 
would assume that the use of convolutional neural networks that have been trained with all 
the examples from the Imagenet dataset would result in the detector working well across 
datasets, we show that such is not the case. To adapt the object detection in the unsuper¬ 
vised setting is challenging. If we had observations from the other dataset, then we could 
use fine-tuning to adapt the convolutional neural network itself. Without having access to 
supervision, we consider a recent domain adaptation technique based on subspace alignment 
to adapt the feature subspaces between source and target subspaces for localized object de¬ 
tection bounding boxes. We further analyse our method by considering the principal angles 
between the subspaces. Our evaluation demonstrates that it is possible to obtain localized 
subspace adaptation for object detection and that this adaptation results in improved perfor¬ 
mance for off-the-shelf improved object detection. 

Rest of the paper is organized as follows. In the next section we discuss the related 
work, and in section 3 the main concepts of subspace learning and detection are developed. 
In section 4 the proposed method has been discussed in detail. Experimental evaluation is 
presented in section 5 and it includes the performance evaluation and detailed analysis. We 
finally conclude in section 6. 

2 Related Work 

The task of visual domain adaptation for object classification has been studied in unsuper¬ 
vised and semi-supervised settings. In this section we briefly survey only the domain adapta¬ 
tion techniques that are related to the present work. A recent extended report [ffl] by Gopalan 
et al. surveys domain adaptation techniques for visual recognition. 

Subspace based methods are commonly used for learning new feature representations 
that are domain-invariant. They thus enable the transfer of classifiers from a source to a 
specific target domain. A common approach behind these subspace based methods is to first 
determine separate subspaces for source and target data. The data is then projected onto a 
set of intermediate sampled subspaces along the geodesic path between source and target 
subspace with the aim of making the feature point domain invariant [□]. This approach 
has been further continued in terms of geodesic flow kernel [□], source and target subspace 
alignment [i] and manifold alignment based approach [□]. 

Different from subspace based techniques, in [O], it is shown that the image represen¬ 
tation learned by convolutional neural networks on Imagenet dataset can be transferred for 
other tasks with relatively smaller dataset by using fine tuning the deep network. However, 
these require annotations for the target dataset. In very recent work a more sophisticated 
technique has been proposed by Zhang et al [O] where a deep transfer network is learned 
that learns a shared feature subspace that matches conditional distributions. However, this is 
not applicable for detecting objects. 

The above mentioned works are applicable for object classification. The problem of do¬ 
main adaptation for object detection has been studied to a lesser extent. One such work 
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applicable to object detection has been the adaptation of Deformable Part Models for object 
detection The adaptation of objects through transfer component analysis [HD] has been 
proposed by Mirrashed et al. [0] specifically for the case of vehicles. In recent work there 
has been a method proposed to adapt a fine tuned convolutional neural network for detection 
by considering it as a domain adaptation technique [IE0]. This technique has shown inter¬ 
esting performance for large scale detection. However, they heavily rely on the presence of 
a large number of pre-trained fine-tuned detectors (200 categories). We do not make such 
assumptions for our method. 

Finally, our work is also motivated by the idea of hierarchical and iterative domain adap¬ 
tation. In [□], it has been proposed to adapt the hierarchy of features to exploit the visual 
information. In [O], Anant et al. propose an iterative hierarchical subspace based domain 
adaptation method to exploit the availability of additional structure in the label space i.e. 
hierarchy. However, these techniques are not applicable for detection. 

3 Background 

The proposed approach builds upon the previously proposed subspace alignment based method 
[0] for visual domain adaptation to adapt the RCNN detector [O]. 

3.1 Subspace Alignment 

Subspace alignment based domain adaptation method consists of learning a transformation 
matrix M that maps the source subspace to the target one [□]. Suppose, we have labelled 
source data S and unlabelled target data T. We normalize the data vectors and take separate 
PC A of the source data vectors and target data vectors. The d eigenvectors for each domain 
are selected corresponding to the d largest eigenvalues. We consider these eigenvectors as 
bases for source and target subspaces separately. They are denoted by X s for source subspace 
and X t for target subspace. We use a transformation matrix M to align the source subspace 
X s to target subspace X t . The mathematical formulation to this problem is given by 

F(M) = \\XsM-X T \\p M* = argmin(F(M)). (1) 

M 

X s and X t are matrices containing the d most important eigenvectors for source and target 
respectively and ||. \\p is the Frobenius norm. The solution of eqn. 1 is M* = X' s Xt and hence 
for the target aligned source coordinate system we get X a = XsX' s Xj . Once we get target 
aligned source co-ordinate system, we project our source data and train the classifier in this 
frame. While testing, target data is projected on the target subspace and classifier score is 
calculated. 

3.2 RCNN-detector 

Convolutional neural nets (CNN) and other deep learning based approaches have improved 
the object classification accuracy by a large margin. RCNN [O] uses the CNN framework 
and bridges the gap between object classification and object detection task. The idea of this 
work is to see how well the result of convolutional neural network on ImageNet task gener¬ 
alizes for the task of object detection on PASCAL dataset. RCNN consists of three modules. 
The first module generates selective search windows [O] in an image which is category in¬ 
dependent. Second module extracts mid level convolutional neural network features for each 
proposed region which has been trained earlier on ImageNet dataset. In the third module, 
SYM classifier is trained by considering all those windows whose overlap with the ground 
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truth bounding box are less then a threshold A as negative examples. Hard negative examples 
are mined from these negative examples during the training. In testing phase, again 2000 se¬ 
lective search windows are generated per image in fast mode. Each proposal is warped and 
propagated forward through pre-trained CNN to compute features. Then, for each class, the 
learned SVM class specific classifier is applied to those extracted features and a score is ob¬ 
tained corresponding to each proposal. Once we get the scores, we decide a threshold and 
the regions with scores greater than the decided threshold are our possible candidates for a 
particular object category. In the last step greedy non maximum suppression is applied to 
obtain desired, accurate and specific bounding box for that object category. 

4 Subspace Alignment for Adapting RCNN 

In this section we describe our approach to adapt the class specific RCNN -detector. On 
the basis of background provided in the previous section, we use subspace alignment based 
domain adaptation over the initial RCNN-detector. Instead of using single subspace for 
the full source and target data, we postulate that using class-specific different subspaces 
for different classes to adapt from source to target domain improves the object detection 
accuracy. 


Algorithm 1 Subspace Alignment based Domain Adaptation for RCNN Detector 

l 

procedure SA based RCNN ADAPTATiON(Source Data S,Target Data T) 

2 

for each image j e Source and Target Image do 


3 

Windows(j) <— Computes el ectiveSearchWindows(j) 


4 

feat(j) <— ComputeCaffeFeat (Window s(j)) 


5 

end for 


6 

I nit RCNNdetect or T rainRCNNonSource(SourceData) 


7 

for each class i e Ob ject Class do 


8 

PosSrc(i ) = () and PosTgt{i ) = () 


9 

for each image j e Source and Target Image do 


10 

ol(j) = ComputOverlap{gTBbox(j ,/), Window s(i)) 

> For source images 

11 

PosSrc{i) = St3Lck(PosSrc(i),feat(i)(ol(j) > y) 

> For source images 

12 

score(i,j) = runlnit RCNNdetect or {image (j)) 

> For target images 

13 

PosTgt(i) = Stack (PosTgt(i),f eat (i) (score (i, j) > o) 

> For target images 

14 

end for 


15 

Xsource(i ) <- PCA(PosSrc(i)) 


16 

X t arget{i ) PCA(PosTgt(l)) 


17 

end for 


18 

for each class i e Ob ject Class do 


19 

ProjectMat(i ) SubspaceAlign(X source (i),X target (i)) 


20 

end for 


21 

AdaptedRCNNdetector TrainRCNNonSource(ProjectedSrcData) 

22 

boxes runAdaptedRCNNdetector(ProjectedTgtData) 


23 

predictBbox runNonMaximumSupression(boxes) 


24 

return predictBbox 


25 

end procedure 



Indeed the more specific subspace for each object category is expected to span the full 
space in which a particular object category lies, more accurately. Since we are dealing with 
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the unsupervised setting hence we don’t have access to bounding boxes of the target domain 
data. Source subspace can easily be found using the bounding boxes available for the source 
data. An important point to note here is that considering only the ground truth bounding 
boxes for getting the class specific subspace in source domain data results in overfitting and 
the subspace obtained by this way doesn’t truly represent the space of the specific class 
but represent a smaller and specific space which is subset of the original space. To avoid 
this problem, we consider all those bounding boxes whose overlap with the ground truth 
bounding box is above a particular threshold y while obtaining the class specific source 
domain subspaces, y is chosen by using cross validation on the source data. Once we obtain 
the class specific source subspaces, we also need to get the class specific target subspaces to 
apply subspace alignment method for domain adaptation. The problem here is that we don’t 
have bounding boxes available for the target data and hence we direcly can not get the search 
windows and their overlap with the actual bounding boxes. To deal with this problem we run 
the RCNN-detector on target dataset which was initially trained on source dataset. Running 
the RCNN-detector gives the score for every search windows on the images of target dataset. 
We consider all the search windows of a specific class as positives samples for subspace 
generation whose score is greater than a certain threshold o. Again this sigma is chosen by 
using cross validation on source dataset. Once we have positive samples for target subspace, 
we generate class specific target subspaces and apply subspace alignment approach on each 
class separately. The detector is trained separately for each class on target aligned source 
co-ordinate system. During the test, target data is projected on target subspace and classified 
using the detector trained on target aligned source co-ordinate system. In the last step, greedy 
non maximum suppression is applied to predict the most accurate window. Hence, the full 
algorithm is a two step process, as summarized in the algorithm 1. 


5 Experiments 


In this section we describe our experimental setup, dataset and then we evaluate the perfor¬ 
mance of our method. We analyze the performance of our added components and discuss 
its significance with the help of similarity in the principle angles of subspaces. As Caffe 
features om have recently resulted in state of the art for classification, we use Caffe fea¬ 
tures for evaluating our method. The initial set of positive examples for CoCo dataset are 
obtained by running R-CNN detector on the Validation set of CoCo. This R-CNN detector 
was trained on PASCAL VOC 2012. Here, Pascal VOC 2012 train+val dataset is the source 
and validation dataset of CoCo dataset is the target. The validation dataset of Microsoft 
CoCo dataset contains around 40,000 images and provides a challenging substantial dataset 
for comparison. While PASCAL dataset mostly consists of iconic view of the object, COCO 
dataset is more challenging and contains clutter, background, occlusion and multiple side 
views. These factors cause significant domain shift in the above mentioned two datasets and 
make these two dataset a suitable choice to evaluate our algorithm. More details about the 
differences between PASCAL VOC 2012 and COCO dataset has been discussed in m in 
great detail. We provide here in figure 1, some images from both the dataset to visualize the 
differences between them. It can be observed from the figure 1 that while most of the images 
in PASCAL dataset contains single instance of object and less background clutter, images 
in COCO datset have generally more than one instance and also contain more background 
clutter. 
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(a) PASCAL (b) PASCAL (c) PASCAL (d) COCO (e) COCO (f) COCO 
Figure 1: Image samples taken from COCO validation set and PASCAL VOC 2012 to show 
domain shift 


5.1 Experimental Setup 

PASCAL dataset is simpler than COCO and also smaller hence it has been considered as 
source dataset and COCO is considered as target dataset. For target dataset(COCO) sub¬ 
space generation, the examples having a classifier score greater than a threshold of 0.4 are 
considered positive examples for the target subspace. Non maximum supression is removed 
from the process for consistency in the number of samples for subspace generation. For 
source dataset(PASCAL), we evaluate the overlap of each object proposal region with the 
ground truth bounding box and consider the bounding boxes with threshold greater than 0.7 
with the ground truth bounding box as candidates for our source subspace generation. The 
dimensionality of subspace for both the source and target dataset is kept fixed at 100. Sub¬ 
space alignment is done between source and target subspaces and source data set is projected 
on the target aligned source subspace for further training. Once the new detectors are trained 
on the transformed feature the same procedure is applied as RCNN for detection on projected 
target image features. 

5.2 Results 


Here in this section we provide evidence to show the performance of our method, compare 
our results to other baselines and analyse the results. First we consider the statistical dif¬ 
ference between the PASCAL VOC 2012 and COCO validation set data. We run RCNN 
detector on both source and target data. We plot the histograms of score obtained for both 
the dataset. It can be observed from their histogram in image 2 that there exists statistical dis¬ 
similarity between both these datasets. Therefore, there is a need for domain adaptation. The 
histogram evaluation has been done in two setting, first in a category wise setting and second 
as a full dataset jointly. The findings of both these settings is similar and demonstrates the 
statistical dissimilarity between these two datasets. 



Figure 2: Histogram of scores. Fig 1 is for CoCo dataset and fig 2 for PASCAL VOC. Scores 
is taken along x-axis and no. of object region with that score along y-axis 


Now we evaluate our method on these two datasets. First baseline to compare with our 
method is a simple RCNN detector trained on PASCAL VOC 2012. We evaluate the perfor¬ 
mance of RCNN detector for all the 20 categories which are there in the PASCAL datasets. 
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The mean average precision for this baseline comes to 23.60% without applying bounding 
box regression. Second baseline to compare with our method is RCNN - full image trans¬ 
form. RCNN - full image transform here indicates that while aligning the subspaces for 
source and target data, we don’t consider class specific subspace alignment. We take the full 
target and source data at once, learn separate subspaces for source and target data, apply the 
transformation on source domain subspace to align with the target domain subspace and in 
the last step we train the detector with projected features in target aligned source co-ordinate 
system. This baseline is similar to the work done for unsupervised domain adaptation of 
object classifier using subspace alignment [i] as discussed in section 3.1. The mean average 
precision using RCNN - full image transform is 22.70%. We also compare our result with the 
result obtained on the COCO validation set using deformable part model trained on PASCAL 
VOC 2012 [IZ3] as reported in [ED]. The mean average precision for deformable part model is 
reported as 16.9%. We got consistent improvement over all the baselines using our proposed 
approach. The proposed approach, RCNN - local class specific transform, gives the mean 
average precision of 25.43% which is almost 1.8% better than the traditional RCNN detector 
and 8.5% higher than the deformable part model based object detector. The complete result 
with category wise performance is given in table 1. We also show here some of the accu¬ 
rate detections obtained using our method and visually compare the detection by traditional 
RCNN object detector on the same images. In figure 3 we show some detections obtained us¬ 
ing our proposed method. Figure 4 is used to illustrate the cases when our proposed method 
works but RCNN fails. Figure 5 contains some images both the detectors failed to perform. 
In the next subsection 5.2.1, we discuss the intuition behind our performance and explain it 
with the principal angle analysis of source and target subspaces. 



(a) Our method (b) RCNN (c) Our method (d) RCNN 
Figure 4: Examples where RCNN fails to perform but our method performs well 



Figure 5: Examples where both the detectors fails to perform. 4 th image is detected as human 


5.2.1 Principal Angle Analysis 

As is evident from the table 1 , the proposed method outperforms the baselines for almost all 
of the classes except for a few categories. We also analyse category wise similarity between 
the source subspaces and target subspaces to explain this phenomena. The similarity between 
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No. 

class 

RCNN- 
No Transform 

RCNN - 
Full Transform 

Proposed, RCNN - 
Local Transform 

DPMv5-P 

1 

plane 

36.72 

35.44 

40.1 

35.1 

2 

bicycle 

21.26 

18.95 

23.28 

1.9 

3 

bird 

12.50 

12.37 

13.63 

3.7 

4 

boat 

10.45 

8.8 

10.61 

2.3 

5 

bottle 

8.75 

11.46 

8.11 

7 

6 

bus 

37.47 

38.12 

40.64 

45.4 

7 

car 

20.6 

20.4 

22.5 

18.3 

8 

cat 

42.4 

43.6 

45.6 

8.6 

9 

chair 

9.6 

6.3 

8.8 

6.3 

10 

cow 

23.28 

20.40 

25.3 

17 

11 

table 

15.9 

14.9 

17.3 

4.8 

12 

dog 

28.42 

32.72 

31.3 

5.8 

13 

horse 

30.7 

31.11 

32.9 

35.3 

14 

motorbike 

31.2 

29.05 

34.6 

25.4 

15 

person 

27.8 

28.8 

30.9 

17.5 

16 

plant 

12.65 

7.34 

13.7 

4.1 

17 

sheep 

19.99 

21.04 

22.4 

14.5 

18 

sofa 

14.6 

8.4 

15.5 

9.6 

19 

train 

39.2 

38.4 

41.64 

31.7 

20 

tv 

28.6 

26.4 

29.9 

27.9 


Mean AP 

23.60 

22.7 

25.43 

16.9 


Table 1: Domain adaption detection result on validation set of COCO dataset. RCNN- No 
Transform column represent running the simple RCNN detector on COCO dataset which has 
been trained on PASCAL VOC 2012. RCNN- Full Transform denotes the result obtained 
by retraining the detector while considering the full source images and target images for 
aligning the subspace of source and target dataset. RCNN- Local Transform means the 
category specific subspace alignment method (proposed method) for adapting the detector 
and DPM5-P denotes the result from deformable part model on COCO dataset, trained on 
PASCAL VOC 2012 as reported in [EDI. 


two subspaces is calculated by 
first finding the principal angles 
between two subspaces and then 
taking the 2-norm of cosines of 
vectors of principal angles be¬ 
tween them. 

d(X s ,X t ) = ||Cas0|| 2 , where 
0 is the vector of principal angles 
between subspaces X s and X t 

In the results we observe that 
our method is either improving 
very little or not improving at all 
for which the traditional RCNN Figure 6: Color map of similarity between learned 
detector itself is not performing subspaces of different categories in source and target 

dataset 
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well. Our method relies on the initial detection obtained by the RCNN. When the initial 
detection is better, then, we learn a better subspace and hence we obtain a better subspace 
alignment. For classes where initial detections are not good then no meaningful subspace 
can be learned for such categories. This can be observed for class no. 4(boat), 5(bottle) and 
9(chair) from figure 6 that similarity between source subspace of those classes and target 
subspace of the corresponding classes are very low. Hence, the performance of our method 
on these classes are not good. But for the rest of the classes, as can be seen from the diagonal 
blocks of the figure 6, the target subspaces are quite discriminative and inter class subspaces 
are not very similar. This gives improvement in the result. One more interesting thing to 
discuss here is that for class no. 12, though the similarity between source subspace for class 
no. 12 and target subspace for same class is good but this subspace is also very similar to 
the subspaces of other categories as a few numbers of yellow blocks are available there in 
the row and column corresponding to class 12. In that region, full image transformation is 
expected to perform better for this class and our experimental results also demonstrate this 
point. 


6 Conclusion 

In this paper, we have presented a method for adapting the state-of-the-art RCNN object de¬ 
tector for unsupervised domain adaptation from source to target dataset. The main challenge 
addressed in this paper is to obtain localized domain adaptation for adapting object detec¬ 
tors. The adaptation has been achieved by using approach based on subspace alignment that 
efficiently projects the source subspace to the target subspace. The proposed method results 
in improved object detection. Thus, one can use RCNN for instance to detect persons in 
novel settings with improved detection accuracy. 

The main limitation of this approach is that the present method does not work well for 
classes where the RCNN results are weak. This limitation can be addressed by partially 
relying on supervision for classes where the source detection result is itself quite weak. 
Further, once there is some supervision available, it would be interesting to jointly consider 
learning domain adaptation at both feature and subspace level simultaneously. 
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